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Abstract 

We consider estimating the Shannon entropy of a discrete distribution P from n i.i.d. samples. Recently, Jiao, Venkat, Han, 
and Weissman, and Wu and Yang constructed approximation theoretic estimators that achieve the minimax L2 rates in estimating 
entropy. Their estimators are consistent given n samples, where S is the alphabet size, and it is the best possible sample 

complexity. In contrast, the Maximum Likelihood Estimator (MLE), which is the empirical entropy, requires nt$>S samples. 

In the present paper we significantly refine the minimax results of existing work. To alleviate the pessimism of minimaxity, we 
adopt the adaptive estimation framework, and show that the minimax rate-optimal estimator in Jiao, Venkat, Han, and Weissman 
achieves the minimax rates simultaneously over a nested sequence of subsets of distributions P, without knowing the alphabet 
size S or which subset P lies in. In other words, their estimator is adaptive with respect to this nested sequence of the parameter 
space, which is characterized by the entropy of the distribution. We also characterize the maximum risk of the MLE over this 
nested sequence, and show, for every subset in the sequence, that the performance of the minimax rate-optimal estimator with 
n samples is essentially that of the MLE with nlnn samples, thereby further substantiating the generality of the phenomenon 
discovered by Jiao, Venkat, Han, and Weissman. 


Index Terms 

adaptive estimation, entropy estimation, best polynomial approximation, high dimensional statistics, large alphabet, minimax 
optimality 


I. Introduction 

Shannon entropy H(P), defined as 

S 1 

#(E) = 5>ln- CD 

U. Pi 

is one of the most fundamental quantities of information theory and statistics, which emerged in Shannon’s 1948 masterpiece [Tj 
as the answer to foundational questions of compression and communication. 

Consider the problem of estimating Shannon entropy H{P) from n i.i.d. samples. Classical theory is mainly concerned with 
the case where the number of samples n 00 , while the alphabet size S is fixed. In that scenario, the maximum likelihood 
estimator (MLE), // ( P n ), which plugs in the empirical distribution into the definition of entropy, is asymptotically efficient |2] 
Thm. 8.11, Lemma 8.14] in the sense of the Hajek convolution theorem jj] and the Hajek-Le Cam local asymptotic minimax 
theorem |4j. It is therefore not surprising to encounter the following quote from the introduction of Wyner and Foster |5 | who 
considered entropy estimation: 

“The plug-in estimate is universal and optimal not only for finite alphabet i.i.d. sources but also for finite 
alphabet, finite memory sources. On the other hand, practically as well as theoretically, these problems are 
of little interest. ” 

In contrast, various modern data-analytic applications deal with datasets which do not fall into the regime of fixed alphabet 
and n —> 00 . In fact, in many applications the alphabet size S is comparable to, or even larger than the number of samples n. 
For example: 

• Corpus linguistics: about half of the words in the Shakespearean canon appeared only once fSJ. 

• Network traffic analysis: many customers or website users are seen a small number of times |[7j. 

• Analyzing neural spike trains: natural stimuli generate neural responses of high timing precision resulting in a massive 
space of meaningful responses HMD- 


A. Existing literature 

The problem of entropy estimation in the large alphabet regime (or non-asymptotic analysis) has been investigated extensively 
in various disciplines, which we refer to 0 for a detailed review. One recent breakthrough in this direction came from Valiant 
and Valiant 1121, who constructed the first explicit entropy estimator whose sample complexity is n x EE samples, which 
they also proved to be necessary. It was also shown in 0 0 that the MLE requires n x S samples, implying that MLE is 
strictly sub-optimal in terms of sample complexity. 

However, the aforementioned estimators have not been shown to achieve the minimax L 2 rates. In light of this, Jiao et al. 
0 and Wu and Yang in 0 independently developed schemes based on approximation theory, and obtained the minimax 
L 2 convergence rates for the entropy. Furthermore, Jiao et al. 0 proposed a general methodology for estimating functionals, 
and showed that for a wide class of functionals (including entropy, mutual information, and Renyi entropy), their methodology 
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can construct minimax rate-optimal estimators whose performance with n samples is essentially that of the MLE with n In n 
samples. They also obtained minimax L 2 rates for estimating a large class of functionals. On the practical side. Jiao et al. fT6 | 
showed that the minimax rate-optimal estimators introduced in 0 can lead to consistent and substantial performance boosts 
in various machine learning algorithms. 


Recall that the minimax risk of estimating functional F(P) is defined via infp suppg^ s , Ej 


(f-F(P))' 


where 


denotes all distributions with alphabet size S , and the infimum is taken with respect to all estimators F. Correspondingly, 
the maximum risk of MLE F{P n ), which evaluates the functional F(-) at the empirical distribution P n , is defined via 
supp £jVts Ep ( F(P n ) — F(P)) 2 . The following table in Jiao et al. mi summaries the minimax L 2 rates and the L 2 rates of 
MLE in estimating H(P) and F a (P) = Y^,i=iP?- Whenever there are two terms, the first term corresponds to squared bias, 
and the second term corresponds to variance. It is evident that one can obtain the minimax rates from the L 2 rates of MLE via 
replacing n with nlnn in the dominating (bias) terms. We adopt the following notation: a n A b n means sup n a n /b n < 00 , 
a n h b n means b n A a n , a n x b n means a n A b n and a n A b n , or equivalently, there exists two universal constants c, C such 


that 


( 2 ) 



Minimax L 2 rates 

L 2 rates of MLE 

H(P) 

(„Lo’+V ( n — Ins) ( 1 11 U 115 ) 

£ + ^ (nb5)(l4| 

F a (P), 0 < a < 2 

(ninn)2« ( n *= sl/a / ln5 ’Inn i In5) ( JTl)) 

(n h S 1 ' 0 ) [14] 

F a (P), 2 < a < 1 

(n>S^/lnS) (|11|) 

$«+ S n (' ntS V“) [14| 

F a (P), 1 < a < | 

(nln?i)~ 2 ^“ -1) (S >z nlnn) ( [11]) 

n -2(a-l) (S' y n ) [ 14 1 

F a (P),a > § 

n" 1 114 

n 1 


TABLE I: Comparison of the minimax L 2 rates and the L 2 rates of MLE in estimating H(P) and F a {P) = S i=iP?■ 
Whenever there are two terms, the first term corresponds to squared bias, and the second term corresponds to variance. It is 
evident that one can obtain the minimax rates from the L 2 rates of MLE via replacing n with nlnn in the dominating (bias) 
terms. 


B. Refined minimaxity: adaptive estimation 


One concern the readers may have about results on minimax rates is that they are too pessimistic. Indeed, in the definition 
ini p sup PGMs E P (p - P(P)) , we have considered the worst case distribution P over all possible distributions supported 
on S elements, and it would be disappointing if the estimator in Jiao et al. 0 turned out to behave sub-optimally when we 
consider distributions lying in subsets of A is- A usual approach to alleviate this concern is the adaptive estimation framework, 
which we briefly review below. 

The primary approach to alleviate the pessimism of minimaxity in statistics is the construction of adaptive procedures, 
which has gained particular prominence in nonparametric statistics 0. The goal of adaptive inference is to construct a single 
procedure that achieves optimality simultaneously over a collection of parameter spaces. Informally, an adaptive procedure 
automatically adjusts to the unknown parameter, and acts as if it knows the parameter lies in a more restricted subset of the 
whole parameter space. A common way to evaluate such a procedure is to compare its maximum risk over each subset of the 
parameter space in the collection with the corresponding minimax risk. If they are nearly equal, then we say such a procedure 
is adaptive with respect to that collection of subsets of the parameter space. 

The primary results of this paper are twofold. 


1) First, we show that the minimax rate-optimal entropy estimator in Jiao et al. 0 is adaptive with respect to the collection 
of parameter space Ms(H), where Ms(H) = {P : H(P) < II. P £ M. 5 }. Moreover, the estimator does not need to 
know S nor H, which is an advantage in practice since usually the alphabet size S nor an a priori upper bound on the 
true entropy H(P) are known. 

2) Second, we show that the sample size enlargement effect still holds in this adaptive estimation scenario. Table [I] 
demonstrates that in estimating various functionals, the performance of the minimax rate-optimal estimator with n 
samples is nearly that of the MLE with nlnn samples, which the authors termed “effective sample size enlargement” 
in pi]. We compute the maximum risk of the MLE over each A4s(H), and show that for every II , the performance of 
the estimator in 0 with n samples is still nearly that of the MLE with nlnn samples. 

These facts suggest that the estimator in Jiao et al. 0 is near optimal in a very strong sense, for which we refer the readers 
to 0 for a detailed discussion on methodology behind their estimator, literature survey, and experimental results. 
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C. Mathematical framework and estimator construction 

Before we discuss the main results, we would like to recall the construction of the entropy estimator in ED- The approach 
is to tackle the estimation problem separately for the cases of “small p” and “large p” in II (P) estimation, corresponding to 
treating regions where the functional is “nonsmooth” and “smooth” in different ways. Specifically, after we obtain the empirical 
distribution P n , for each coordinate P n (i), if P ra (*) "C In n/n, we (i) compute the best polynomial approximation for — pi In pi 
in the regime 0 < pi In n/n, (ii) use the unbiased estimators for integer powers p-' : to estimate the corresponding terms in 
the polynomial approximation for —p, In p, up to order I\ n ~ In n, and (iii) use that polynomial as an estimate for — p, In p.,. If 
P n (i) In n/n, we use the estimator —P n (i) In P n (i) + ^ to estimate — pi In pi. Then, we add the estimators corresponding 
to each coordinate. 

We define the minimax risk for Multinomial model with n observations on alphabet size S for estimating 11(1’), P £ Mls(II) 


as 


R{S, n, H) = inf sup E Mu itinomiai (h - H(P)\ 

H PGMs(H) V ' 


(3) 


which is the quantity we will characterize in this paper. To simplify the analysis, we also utilize the Poisson sampling model, 
i.e., we first draw a random variable N ~ Poi(n), and then obtain N samples from the distribution P. It is equivalent to 
having a ,S'-dimensional random vector Z such that each component Z, in Z has distribution Poi (npf), and all coordinates of 
Z are independent. 

The counterpart of minimax risk in the Poissonized model is defined as 

Rp(S, n, H) = inf sup E Poisson (H - H(P)Y . (4) 

H PCiMs(H) k ' 

The following lemma, which follows from flT) , (T5) , shows that the minimax risks under the Multinomial model and the 
Poissonized model are essentially equivalent. 


Lemma 1. The minimax risks under the Poissonized model and the Multinomial model are related via the following inequalities: 


R P (S , 2 n, H) - e~ n/i H 2 < R(S, n, H) < 2 R P (S, n/2, H). 


(5) 


For simplicity of analysis, we conduct the classical “splitting” operation [ 18J on the Poisson random vector Z, and obtain two 
independent identically distributed random vectors X = [X 1; X 2 , ■ ■ ■, A'g] , Y = [ Y { , Y 2 , ..., Vp] T , such that each component 
X, in X has distribution Poi(Vf,p,/2), and all coordinates in X are independent. For each coordinate i, the splitting process 
generates a random variable Ti such that T,|Z ~ \’>(Z, . 1/2), and assign X, = 1), Y t = Zi — 7). All the random variables 
{!) : 1 < i < S’} are conditionally independent given our observation Z. We also note that for random variable X such that 
nX ~ Poi (np), 

E n( A ’-D=f‘. <« 

r—0 

for any k £ N+. 

For simplicity, we re-define n/2 as n, and denote 


Xi , Yj ci In n A 

Pi, l = — ,Pi, 2 = —, A = -, A = c 2 lnn,t = —, 

n n n 4 


(7) 


where ci, c 2 are positive parameters to be specified later. Note that A, K. / are functions of n, where we omit the subscript n 
for brevity. 

The estimator H in Jiao et al. 0 is constructed as follows. 

s 

[L H {Pip)t(Pi,2 < 2A) + U H (Pi,i) l(pi, 2 > 2A)], (8) 

i=l 


where 


Sk,h( x) 


-e^( 4a ) fc+i ri(-D 

k— 1 r—0 


L h {x) 

U H {x) 


= min {S k ,h(x), 1} 
= I n (x) I —a; In a; + 



(9) 

( 10 ) 

( 11 ) 


We explain each equation in detail as follows. 

1) Equation (|8j: Note that pip and pi^ 2 are i.i.d. random variables such that npij ~ Poi (npf). We use p.j. 2 to determine 
whether we are operating in the “nonsmooth” regime or not. If pi^ 2 < 2A, we declare we are in the “nonsmooth” regime. 
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and plug in pi i into function L#(*). If Pi 2 > 2A, we declare we are in the “smooth” regime, and plug in pi 1 into 
Uh{\ 

2) Equation 

The coefficients Tk.H, 0 < k < K are coefficients of the best polynomial approximation of —a; In a: over [0,1] up to 
degree I\, i.e., 

K 

r k , H x k = arg min sup |y(a;) - (-a;lna;)|, (12) 

fc=0 y(x)epo\y K a;e[ 0 ,l] 

where poly K denotes the set of algebraic polynomials up to order K. Note that in general „ depends on K, which 
we do not make explicit for brevity. 

Then we define {g k ,nh< k <K 

gk,H = rk,H, 2 < k < K,gi. H = ft ,h — ln(4A). (13) 

Lemma [9] shows that for nX ~ Poi(np), 

K 

E S K , H (X) = ^.g fe , ff (4A)- fc +V (14) 

k=l 

is a near-best polynomial approximation for — plnp on [0,4A]. Thus, we can understand Sk,h{X), u X ~ Poi(np) as 
a random variable whose expectation is nearly Q the best approximation of function —a; In a; over [0,4A]. 

3) Equation ©: 

Any reasonable estimator for —plnp should be upper bounded by the value one. We cutoff Sk,h{x ) by upper bound 1, 
and define the function Lh{x), which means “lower part”. 

4) Equation ( fTT| >: 

The function Uu (x) (means “upper part”) is nothing but a product of an interpolation function I n (x) and the bias- 
corrected MLE. The interpolation function /„( x) is defined as follows: 

! 0 x < t 

g (a; — t\t) t < x < 2t (15) 

1 x > 2t 

The following lemma characterizes the properties of the function g(x\ a) appearing in the definition of I n (x). In particular, 
it shows that I n (x ) £ C 4 [0,1]. 

Lemma 2. For the function g{x ; a) on [0, a] defined as follows, 

9(I ; a) * 126 Q” - 420 (?)' + 540 (fifi - 3 15 Q‘ + TO Q)‘ , 

we have the following properties: 

5 (0;a)=0, fl (l) (0;a) = 0,l<*<4 (17) 

g(a;a) = 1, g^ l \a; a) = 0,1 < i < 4 (18) 

The function g{x] 1) is depicted in Figure [I] 

^ote that we have removed the constant term from the best polynomial approximation. It is to ensure that we assign zero to symbols we do not see. 


5 



II. Main Results 

Since sup Pg ^ s H(P ) = Ini', we assume throughout this paper that 0 < H < In S'. Denote by A is(H) the set of all 
discrete probability distributions P with support size |supp(P)| = S and entropy H(P ) < H. We say an estimator // = II (Z) 
is within accuracy e > 0 , if and only if 

sup (e p \H- H(P)\ 2 ) 2 < e. (19) 


For the plug-in estimator H(P n ), the following theorem presents the non-asymptotic upper and lower bounds for the L 2 
risk. 


Theorem 1 . If H > Hq > 0, where H 0 is a universal positive constant, then for 


sup E P \H(P n )-H(P)\ 2 

PeMs(H) 



the plug-in estimator H(P n ), we have 

if S In S < enH , 

otherwise. 


( 20 ) 


Note that the only assumption in Theorem [T] is that the upper bound H should be no smaller than a constant, which is a 
reasonable assumption to avoid the subtle case where the naive zero estimator H = 0 has a satisfactory performance. The 
minimum sample complexity of the plug-in approach can be immediately obtained from Theorem [T] 


Corollary 1. If H > Hq > 0, where Hq is a universal positive constant, the plug-in estimator II (P n ) is within accuracy e if 
and only if n f 


Recall that it requires n>z ( 7 ) samples for the MLE to achieve accuracy e when there is no constraint on the entropy [11 
Hence, when the upper bound on the entropy is loose, i.e., II x In S, the minimum sample complexity in the bounded entropy 
case is exactly the same, i.e., we cannot essentially improve the estimation performance. On the other hand, when the upper 
bound is tight, i.e., H <C In S, the required sample complexity enjoyed a significant reduction, i.e., we only need a sublinear 
number of samples for accurate entropy estimation. 

When it comes to the maximum L 2 risk, we conclude from Theorem |T] that the bounded entropy property helps only at the 
boundary, i.e., when n is close to S and II is small. Moreover, this help vanishes quickly as S increases: when n = .S ' 1 ~ 
the maximum L 2 risk will be at the order ( SH ) 2 , which is the same risk achieved by the naive zero estimator when <5 is not 
close to zero. 

Is the plug-in estimator H(P n ) optimal in the minimax sense? It has been shown in (TTJ, ( 12 ], [15 1 that when there is no 
constraint on H(P), i.e., H = In S, the answer is negative. What about subsets of A 4s, such as A is{H)l The following 
theorem characterizes the minimax L2 rates over A is{H). 


Theorem 2. If H > Hq > 0, where Hq is a universal positive constant, then 


inf sup E P |IT- H(P)\ 2 x 

H PClM s {H) 


S 2 

(n In n) 2 


HlnS 

n 




5 In 5 
nH In n ) J 


2 


if S In S < enH In n, 
otherwise. 


( 21 ) 
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where the infunum is taken over all possible estimators. Moreover, the upper bound is achieved by the estimator in under 
the Poissonized model without the knowledge of H nor S. 


An immediate result on the sample complexity is as follows. 

Corollary 2. If II > Hq > 0, where Hq is a universal positive constant, the minimax rate-optimal estimator in / [/ / 1 / is within 
accuracy e if and only if n A 


For the minimum sample complexity, we still distinguish H into two cases. Firstly, when H x In S', the required sample 
complexity is nx which recovers the minimax results with no constraint on entropy in [11], Secondly, when H <C In S, 

there is a significant improvement. 

We also conclude from Theorem [2] that the bounded entropy constraint again helps only at the boundary, and this help 
vanishes quickly as S increases: when n = 5 1 ~~' 5 , we do not have sufficient information to make inference, and the naive zero 
estimator is near-minimax. 

To sum up, we have obtained the following conclusions. 


1) The minimax rate-optimal entropy estimator in Jiao et al. GD is adaptive with respect to the collection of parameter 
space Ms(H), where A4 5 (//) = {P : H(P) < H,P £ M.s}- Moreover, the estimator does not need to know S nor 
H, which is an advantage in practice since usually the alphabet size S nor an a priori upper bound on the true entropy 
H(P) are known. 

2) Second, the sample size enlargement effect still holds in this adaptive estimation scenario. Table [I] demonstrates that in 
estimating various functionals, the performance of the minimax rate-optimal estimator with n samples is essentially that 
of the MLE with n In n samples, which the authors termed “sample size enlargement” in GD- Theorems [T] and [2] show 
that over every Ms{H), the performance of the estimator in iiij with n samples is still essentially that of the MLE 
with nInn samples. 


III. Proof of Upper Bounds in TheoremQ] 

First we consider the case where S In S < enH. For the bias, it has been shown in 1131 that 


Bias(iT(P n )) < In ( 1 + 


5—1 


< 


S 


As for the variance, 03 shows that by the Efron-Stein inequality that 


2 s / s ^ 

Var(P(P„)) < ~^Pi (In p t - 2 ) 2 < - ^p;(lnpi ) 2 + AH + 4 


i=l 


\i=l 


Lemma 3. For any discrete distribution P = (pi,P 2 > * * • ,Ps) with alphabet size S > 2, we have 

Pi (In Pi ) 2 < 2 In S ■ (y2~ Pi In Pi) + 3 - 


In light of Lemma [3] we conclude that 

Var(P(P ra )) < - (Y pi(lnpi) 2 + 4H + 4^) < - (2H In S + AH + 7 ) A 

n \i=t J n 


HlnS 


where we have used the assumption H > Ho > 0 in the last step. 

Hence, when 5 In 5 < enH, we have 

E P (H(P n ) - H{P)f = (Bias (H(P n ))) 2 + Var (H(P n )) A ^ 

n 

which completes the proof for the first part. For the second part, we introduce a lemma first. 
Lemma 4. For p < U— and np ~ B (n,p), where c is a positive integer, we have 


i n 

r Al cine / enp\ c v-a 

0 < — pmp — E[— pmp\ < — pmynp) +pmc -f-^- J + 2_^ 


/enp\ 

| C , P lnfc + l 

/ enp\ 

V c ) 

k=c-\- 1 

l k ) 


( 22 ) 


(23) 


(24) 


(25) 


(26) 


n 


(27) 
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Define £ {p, ) = —p* In p,. In light of Lemma [4j for any positive integer c, we have 




< 

E 

-Pi ^(np*) + 


l: Pi ^ 2en 

L 


< 

E 

-Pi In 

mpi y 


‘■Pi< 2^ 



< 

E 

-Pi In 

<np*y 





< 

E 

-Pi In 

<np 4 y 


‘■Pi< 2^ 



< 

E 

~Pi In 

<np 4 y 






clnc ( enpi 


(=T+ E 


In k + 1 / enpi \ k 


n 


i-Pi< 21k L 


k=c -\-1 

c In c / enpi \ c In k + 1 / enpi \ k 


(=*) + E 


n 


m 
m 


clnc 2en __ 

-•- 2 + \ 

n c n 

fe=C+l 


fc=c+l 

In k + 1 2 en ( c ' k 

c 


a)' 


k=c -\-1 


In k + 1 
k2 k 


(28) 

(29) 

(30) 

(31) 

(32) 


where we have used the convexity of x , 0 < x < 1 for any k > 1 in (301. We consider the following optimization problem: 


maximize —p t In subject to — pi In p* < H, A\ = ||f : 

^ : Pi^2erT i'-Pi-len 




< S 


(33) 


It is straightforward to show that in the solution to (33 1 , all p, < c/2en should be equal, say, to po . Then (33 i reduces to 
maximize Aipo In 


- 

\npoJ 


subject to 0 < Po < n—, < S , -Axpo lnp 0 < H 

2 en 


whose optimization result is no larger than 


maximize Aipo In ( — ) subject to A i < S, -A ± p 0 lnp 0 < H. 

\npoJ 

Then it is easy to check that the solution to (351 is Ai = S and A | p 0 In p 0 = H. Then we have p 0 x „ „, and 


E |Bias(^(pi))| < Sp 0 In ( — ) + 2e • 2 C (lnc+1) 

\npoJ 

H , fcS\nS\ 

~ h^ 111 \ nH ) 


i-Pi< AA; 


+ 2 “ c lnc. 


Now we set c = n« with 


H , /SlnS 
e = -—- In 


> 0 


In S' \ nH 

and we assume without loss of generality that c is an integer. We can easily check that 

H In c In n H (S In S \ 

2 1 " c = 5 w = i^ e =i^ 1 "Ufl-j 


which leads to the desired result 


v-^ in- wi H ( SlnS\ 




As for the second part of bias, it has been shown in [11 that |Bias (£(p-j))| < ^2-? holds for all i, hence 

5 In 2 


E |Bias(^(pi))| < 

i[ Pi> 25fT 


5 In 2 
n 




■ A 9. 


(34) 

(35) 

(36) 

(37) 

(38) 

(39) 

(40) 

(41) 


We use the bounded entropy property to bound A 2 . Due to the concavity of — x In x , 0 < x < 1, the minimum of ; e — p. t In p* 

is attained when all but one pi are at the boundary p, = , hence 

s 


H > Y Pi 1 n Pi > Y ~ Pi 1nPi - (^2 - 1) • ln {~T^] ■ 


i=1 


i: Pi> 2^ 


(42) 









































As a result, we have Ao A fA. and 

7 " — m n. ’ 


In n ’ 


v-^ H H ( S'In S'\ 

£ i B,as({(!, ‘»i - ht; - hs ln (ppr) 


(43) 




when S In S' > enH (the last inequality can be shown by considering two cases S > n 2 and S < n 2 separately). Hence, 


|Bias(if(P„))| < Y l Bias (£(ft))l + Y l Bias (^(^))l ^ ^ ln l ' SlnS ^ 

*:p»<2fc i-Pi> sffj 


In S' V nH J 


and the squared bias is the dominating term since 


H , / SlnS\ 


y 


(Inn) 


where ^ ln is an upper bound for the variance Var (H(P n )) 111 


IV. Proof of Lower Bounds in TheoremQ] 

We first derive a lower bound for the bias term. When S In S < enH, we recall the following result in 03- 
Lemma 5. For p> [0,1], we have 

—plnp — E[—plnp] > -— + ^ 


2 n 20 n 2 p 12 n 2 


If we choose P , 1 — AK q • • • ,0), where 

Vn’n 5 ’ n ’ n ’ ’ 5 /’ 


= min U £j, S- 1, max {iV e N : In (if) - (l - if) In (l - if | < P 


we have 


|Bias(7J(P n ))| > K ■ 


n —15 1 5 \ . f S Hi S 

-s- 1 - 5 - I tr mm < 1 , —,-> A — 

2n 2 300n 4n 3 J 1 n In n \ n 


where we have used the assumption S In S < enH and II > H 0 > 0. Hence, we have proved that 

sup E P \H(P n ) - H(P)\ 2 y 

PeMs(H) n 

For the case where S In S > enH, we establish a lemma first. 

Lemma 6. For np ~ B (n,p), we have 

—plnp — E[—plnp] > —pln(np). 

Proof: Note that np is an integer, we have 

E[—plnp] < E[plnn] =plnn. 


(44) 


(45) 


(46) 


(47) 


(48) 


(49) 


(50) 


(51) 


Consider a distribution P = ( 1 , ■ • • , , A) £ where A £ (0,1) is the solution to 

A 


then Lemma [ 6 ] tells us that 


—A In 


s-i 


S- 1 


- (1 - A) ln(l - A) = H 


\B\as(H(P n ))\ > Y -Piln(npi) = -Ain 
2 = 1 


H /SlnS 
S-lJ ~ hPS n ( nH 


where in the last step we have used the relationship A x AP from (52 1 . 


(52) 


(53) 


We now turn to the lower bound for variance. We will actually prove a stronger result: a minimax lower bound for all 
estimators for the L 2 risk, which naturally is also a lower bound for the maximum risk of the MLE. We use Le Cam’s two- 
point method here. Suppose we observe a random vector Z £ {Z. A) which has distribution Pg where 6 £ 0. Let 9q and ()\ 
be two elements of 0. Let T = 7 (Z) be an arbitrary estimator of a function T(9) based on Z. We have the following general 
minimax lower bound. 
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Lemma 7. ft/9\ Sec. 2.4.2] Denoting the Kullback-Leibler divergence between P and Q by 

7ln(f)dP, *yp«Q, 


D ( p \\Q) - 


. +0O, 


otherwise. 


we have 


inf sup P 0 \T — T(Q)\ > 
t see 


im)-T(6»o)|\ ^ 1 


> -exp(-D(P ei \\Pe 0 )). 


(54) 


(55) 


Applying this lemma to the Poissonized model npi ~ Poi {upf), 1 < i < 5, we know that for 6i = (pi,P 2 > • ■ • ,Ps), @o = 
(g , i,92,- - • ,Qs), 


S oo 


P(P ei ||P eo )=^P(Poi (™ft)||Poi(nft)) = EE p ( p oi (npi) = k) ■ k In— = ^ npj In — = n.D(0i||0o), (56) 


ft 


2—1 fc = 0 


then Markov’s inequality yields 

inf sup E p[h — H{P)\ 

h PeMs(H) v ' 


2 > JlflHWP 


ft 


inf sup P ( |P — .ff(P)| > 

A pgMs(h) 


ft 


|g(^)-g(^o)h 


> l g(0l) . ^^ expt-nP^Hgo))- 


16 


Fix e € (0,1) to be specified later, and let 
A A 


di = 


5-1’ ’5- 


i’ 1 - 4 )' e »=(1 I fr'"'-fPr- 1 - 4+ - 4 ')- 


where A is the solution to (52 1 . Direct computation yields 


D{91 ||6» 0 ) = Ain ^ + (1 - A) In 1 \f A£ = He), 


(57) 

(58) 

(59) 

(60) 


we have h( 0) = h'( 0) = 0, and |/i"(0)| = > 0. Hence, for e small enough we have D(0i\\0o) < e 2 /A. By choosing 

e = (nA)~ 2 E 1, we have 


W) -ff( 0 o )| = 


-A In + A ( 1 - £ ) ln ^ (! - A ) ln (! - A) + (1 - A + Ae) ln(l - A + Ae) 


A Ae ln 

Hence, by Lemma [7] we know that 


5-1 

A 


inf sup Ep|P-7L(P)| 2 A 
A pgMs(h) 


Ae In 


5-1 


n 2 


H 


1 ln 5 


ln 


(^) 


n 2 


H ln 5 


(61) 

(62) 

(63) 


and the lower bound in the Multinomial model follows from Lemma Q] 

V. Proof of Upper Bounds in Theorem|2] 

Define 


£ = HX,Y) = L h (X)1(Y < 2A) + Uh(X)1(Y > 2A), (64) 

where nX = nY ~ Poi(np), and X,Y are independent. We first recall the following lemma from 

Lemma 8 . Suppose 0 < c\ = 16(1 + <5),0 < 8 c 2 ln 2 = e < 1,5 > 0. Then the bias and variance of £(X,Y) are given as 
follows: 


11 


|Bias(£)| A 
Var(0 A 


1 


nlnn 
(ln n) 4 


P(lnp ) 2 

n 


(65) 

( 66 ) 
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In light of Lemma [lOj we have 


s s 1 

|Bias(fT)| <£l Bi as(f(Pi, 1 .P<, 2 ))l 


i= 1 
S 


2 = 1 


n In n n In n 


Var (H) = Var(^(pi,i,Pi, 2 )) < Y 

2 = 1 2=1 

where we have used Lemma [3] in the last step. Hence, 

E P (h - H(P )) 2 = | Bias(7T) | 2 + Var(iT) 


(Inn ) 4 ( p*(lnp*) 2 ^ ^ S'(lnn ) 4 ( HlnS 


n 


n 2 e 


-< 


S 2 S(lnn ) 4 HlnS 

(nlnn ) 2 n 2_e n 


When S'hi S' < entl In n, for e small enough, say, e < |, we have 


S(lnn ) 4 
n 2_e ~ 


1 S 2 HlnS < S 2 
(nlnn ) 2 n — (nlnn ) 2 


H In S 

n 


(67) 

( 68 ) 

(69) 

(70) 


S(1d 


is negligible when compared with others, and 


where we have used the assumption that H > H$ > 0. Hence, the term 
we have reached the end for the case SlnS < enHlnn. 

For the case where SlnS > enHlnn, we need stronger results for the bias and variance in the regime where p < n \ nn - 
The results are summarized in the following lemma. 


Lemma 9. If 0 < c 2 < 1 < Ci, for nX ~ Poi(np), 0 < p < en \ nn , we have 

|ESp- : p(X)+plnp| < -~pln(pnlnn) + (D p -\-ln(Aci/c\)) p 


n 


(71) 

(72) 


where the constant D p is given in Lemma 15 


Using the Poisson tail bound fcf. Lemma 17 1 and similar argument to ED Lem. 8 ], we have the following lemma. 
Lemma 10 . Suppose 0 < ci = 16(1 + 5), 0 < 10c 2 In 2 = e < 1, <5 > 0. Then for 0 < p < en \ nn , we have 

|Bias(£)| S —pln(pnlnn) (73) 

(In n) 4 p 


Var(C) S 

lb~ “ 

Now we proceed to bound the total bias and variance. By looking at the maximization problem 


(74) 


E -Piln{pinlnn) subject to E -Vi In Pi < H, 

i-Pi< i-Pi< e „( n „ 


l ■■ Pi < 


1 


en Inn 


< S. 


(75) 


Using similar arguments to (33 i, all p, < —\— should be equal and both equalities in the constraints hold. As a result. 


we 


have 


_ _ JJ 

Y l Bias (£)l ^ E -Piln(Pznlnn) S — In 


SlnS 


In S V nH In n 


For symbols with pi > — l —, similar arguments in the MLE analysis yield 


hence 


E i Bias (oi ^ 

i-Pi>7Tn^ 


i : Pi > 


1 


nlnn 

when S In S > enH In n. Summing up the bias yields 


1 

en In n 


i - Pi > 


nH In n 

- m —i x nH 

ln(?rln n) 


1 


en In n 


H H , 

A r- A ; —- In 


In n In S V nH In n 


SlnS 


|Bias(LT)| < Y l Bias (C(Pi,i,Pp 2 ))| + Y i Bias (C(PM,Pi, 2 ))| ]~g ln ( pffhfp. 


i-Pi>^fnE 




(76) 


(77) 


(78) 


(79) 
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For the total variance, we have 

Var (H)= Var (£(PM;Pi,2)) + E Var (^(Pi,i>Pi, 2 )) 

(Inn ) 4 p(lnp) 2 ^ ( ^ (In n) 4 p 

n 2 ~ e n 


=5 E 

i-Pi>TTT7 


E 


~< 


-< 


(Inn ) 5 (lnn)^ 


n 
H 
h VS 


1—e 


In 


PlnS 

nif Inn 


(Inn ) 4 

-n 1 — e 


1 2 


(80) 

(81) 

(82) 

(83) 


where in the last step we have used the assumption H > H 0 > 0 again. Combining the total bias and variance constitutes a 
complete proof of the upper bounds in Theorem [2] 

VI. Proof of Lower Bounds in Theorem[2] 

rr2 m ( 

When S In S < enH In n, the lower bound for the squared bias, i.e., the term, can be obtained using a similar 

argument in 1151. Specifically, we can assign two product measures /(<) and /v) v to the first N{< S ) components in the 
distribution vector P, where 


supp(fq) = {0} U 


1 a 2 In n 
ainlnn n 


1 = 0,1 


for some constants a \, 02 > 0 , and 


In particular, 


tni{dt) = 


i = 0 , 1 . 


ainlnn’ 

j — t hitni(dt) — f — tlntn 0 (dt) y 
J 0 Jo 


dni 


and 


inf sup E P (h — H(P)\ y N ( f —t\nt^\{dt)— t — t\nt/j,o(dt) 
H PeMs ^ ' Y \J 0 Jo 


y 


N 2 


(nlnn) 2 

In |!15|, N = .S'. However, in our case, we have an additional constraint that H(P) < H. Since 

ailn(ainlnn) 1 


E Mi [—plnp]= / — tlnt/j,i(dt) < ln(ainlnn) / tni{dt) = 
Jo Jo 


1 In 1 


we have 


= NE^l-plnp] 


N 


(84) 


(85) 


( 86 ) 


(87) 


( 88 ) 


(89) 


One can show that the measures = 0,1 are highly concentrated around their expectations |15|. Hence, in order to 

ensure H(P) < H with overwhelming probability, we can set N x min {nH,S}, and the condition S'In S' < enH Inn and 
H > Hq > 0 yield that N y S. Hence, 

N 2 S 2 


inf sup Ep(h — H(P)\ y 

H PGMs(H) ^ ' 


y 


(nlnn ) 2 (nlnn ) 2 


(90) 


The variance bound " rl 5 has been given in (631, and so far we have completed the proof of the first part. As for the second 
part, the key lemma we will employ is the so-called method of two fuzzy hypotheses presented in Tsybakov G3- Below we 
briefly review this general minimax lower bound. 

Suppose we observe a random vector Z £ (Z. A) which has distribution Pg where 9 £ 0. Let er 0 anc * a i be two prior 
distributions supported on 0. Write F, for the marginal distribution of Z when the prior is a, for * = 0,1. For any function 
g we shall write E/,; (/(Z ) for the expectation of g( Z) with respect to the marginal distribution of Z when the prior on 9 is 
a i. We shall write Egg(Z) for the expectation of g( Z) under Pg. Let T = T(Z) be an arbitrary estimator of a function T(9) 
based on Z. We have the following general minimax lower bound. 
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Lemma 11. [19 Thm. 2.15] Gi\en the setting above, suppose there exist (€ R,s > 0,0 < Poifii < 1 such that 

ao(0 : T(0) < C - s) > 1 - A, 

<ri(0:T(0)>C + s)>l-ft. 

If V(Fi, Fq) < rj < 1, then 


inf sup I 
t see 


T — T(9)\ > s) > 


1-ri-0o- Pi 


(91) 

(92) 

(93) 


where Fi,i = 0,1 are the marginal distributions of Z when the priors are Oi,i = 0,1, respectively. 

Here V(P,Q) is the total variation distance between two probability measures P. Q on the measurable space (Z, A). 
Concretely, we have 

V(P, Q) = sup | P(A) - Q(A)\ = \ [ \p - q\dv , (94) 

AeA 1 J 


where p = = ypy, and v is a dominating measure so that P <C v, Q <C v. 

First we assume that S A nZ. In light of Lemma |Tf~| we constmct two measures as follows. 

Lemma 12. For any 0 < 77 < 1 and positive integer L > 0, there exist two probability measures vq and on [ 77 , 1] such that 

1) f t l ui(dt) = J t l uo(dt), for all l = 0,1, 2, • • • , L; 

2) f — In tvi{dt) — f — lnir'o(dt) = 2 E L [— lna;]^^], 

where El[— In at] [ 77 , 1 ] is the distance in the uniform norm on [ 77 ,1 ] from the function f(x) = — In x to the space spanned by 

{1) x i ■' • ,% L }- 


Based on Lemma 12 two new measures i> ( j- f'l can be constructed as follows: for i = 0,1, the restriction of z>, on [ 77 ,1] is 
absolutely continuous with respect to //,, with the Radon-Nikodym derivative given by 


di>i M _ V 
da, U t ’ 


t G [ 77 , 1 ], 


(95) 


and Fi({0}) = 1 — £^([ 77 ,1]) > 0. Hence, vq, are both probability measures on [0,1], with the following properties 

1 ) / Pviidt) = f t 1 v 0 (dt) = 77 ; 

2) f t l Ui(dt) = f t l i>o(.dt), for all l = 2, • • • , L + 1; 

3) f — t\n.tv\(dt) — J — t\ntv>o{dt) = 2pE L [— lnat]^]. 

The construction of measures z/q , v-\ are inspired by Wu and Yang C3- 
The following lemma characterizes the properties of El[— Inar][ 77 , 1 .]- 

Lemma 13. If K > eL 2 , there exists a universal constant Dq > 1 such that 


E L [-Inx\ [{DoK) -t tl] h In . 


Define 


L = d 2 In n, 77 = 


nH 


H di did 2 Dolnn 


dfD 0 S In S In n ’ 


M = -—- • -f- = 


(96) 


(97) 


13 


yields 


(98) 


In S’ Sp 

with universal positive constants di G (0, e — x ], cf 2 > 2 to be determined later. Without loss of generality we assume that d 2 In 77 , 
is always a positive integer. Due to S In S > enHlnn, we have {Dqp)~ 1 > eL 2 , thus Lemma 

r , i , / 1 \ , / SlnS 

El1 ~- 1,1 [ d ^ l *) - 1,1 

Let g{x) = Mx and let be the measures on [0,M] defined by pt(A) = i>i(g~ 1 (A)) for i = 0,1. It then follows that 

1) f t 1 pi(dt) = f t 1 po(dt) = diH/(S\nS)', 

2) J t l pi(dt) = / t l po(dt), for all l = 2, • • • , L + 1; 

3 ) / — t \ntpi(dt) — J — t lntpo(dt) = 2 t]MEl[— In re] [77,1] - 

Let Pq- 1 and pf^ 1 be product priors which we assign to the length-^ — 1) vector (pi,P 2 , • • • ,Ps- 1 ), an d we set Ps = 
di(l — H/lnS). With a little abuse of notation, we still denote the overall product measure by p$ and pf. Note that P may 
not be a probability distribution, we consider the set of approximate probability vectors 

s 


M s {e, H) = < P : 


^2 Pi - di 


i=1 


< e, H(P) < H,pi >0(1 <i<S) 


(99) 
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with parameter e > 0 to be specified later, and further define under the Poissonized model, 

Rp(S,n,H,e) = inf sup Ep\H — H(P)\ 2 . 

F PeM s (e,H) 

Lemma 14. For any S, n £ N and 0 < e < d\, we have 

1 f 2?7 \ 77 

R(S,n,H) > 7 Fr 2 R P S > - e) (H - ln(di + e)) ,e) - (In S ') 2 exp(--) - • sup 1 : 

V “1 / 4 «1 a:e[di-e,di+e] 

In light of Lemma [14} it suffices to consider Rp(S, n, //. e) to give a lower bound of R{S, n, H). Denote 

2d\H „ r , 


( 100 ) 


x ± E ^sH(P) - E^P) = 2 V ME L [- lna;] [77il] • S = 


In S 


and 


Denote by tt, the conditional distribution defined as 


Fi(A) = 


Fi (Ej n A) 


i = 0 , 1 . 


Now consider 7 To, 7 Ti as two priors. By setting 


< = E„|H(P) + f, .= £, e = jL, 


we have /3o = Pi = 0 in Lemma |TT] Applying union bound yields that 

s 


[( m < 


i =i 


> e 


Fi 


X 
4 J 


_ H / SlnS \ 

- InS ^ \ nH Inn/ 

( 102 ) 

i = 0 , 1 . 

(103) 


(104) 


(105) 

+ p?[H(P)>H} 

(106) 


and the Chebychev inequality tells us that 


Fi 


J2 p i~ dl 


i=i 


> e 


1 S 

< ^^ Var Mffe) ^ 


SM 2 S(lnn ) 4 (ln?z ) 4 


i=i 

s 


,o — 1 

n 2 n 2 


—-T~~— —^ 0 

XI ^ 16^ w __ , _ ^ 16S(MlnM) 2 S(lnS) 2 (lnn) 4 (Inn) 6 


iff(p) -E Mf H(p)i > f < -E Var ,f(-^ ln ^') < 

^ J X. 


-< 


3=1 


X 


(107) 

0 (108) 


where we have used our assumption that S A . For bounding pf[H (P) > //], we first remark that for d.\ < e 1 , 

E M sff(P) < —di lndx + (S — 1 ) J —tlntpi{dt) 

< — d\ lndi — Sln(pM) J tFi(dt) 

, , , diP, /SlnS\ 

= - d ' lndl+ l^S ln [^H-) 

= —rfi 111 rfi 4- diH - 'L(( In (L(( 

< d\H — 2di In d\ 

hence, for d\ sufficiently small, say, d\ < min{|, (minj^, 4 })}, where f(x) = —xhix is defined in [ 0 , e _1 ] and / _1 (-) 
denotes the inverse function of /(•), we have 


(109) 

( 110 ) 

( 111 ) 

( 112 ) 

(113) 


E fJ sH(P) < d\H — 2d\ lndi < — + 2 • min 


Hn 1) Hn + H H 


< 


< . 

- 2 


Hence, similar to ( | 108| >, we have 

f! [H(P) >H}< f? 


\H(P)-E„sH(P)\ > f 


^ S(M In M ) 2 ^ S(lnn ) 4 ^ (Inn ) 4 

- {HI 2)2 - ~^r~ - °- 


(114) 


(115) 


Denote by Fi, Gi the marginal probability under prior 7 r, and pf, respectively, for all i = 0, 1. In light of ( 1061, (1071, (1081 
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and ( |115[ >, we have 

Moreover, by setting 




it was shown in mi Lem. 11] that 


d^Do 


d,2 = lOe 




0 . 


(116) 


(117) 


(118) 


n ns 

Hence, the total variational distance is then upper bounded by 

V(F 0 ,Fi) < V(F 0 ,Go) + V{Go,G 1 ) + V{G 1 ,F 1 ) ^ 0 (119) 

where we have used the triangle inequality of the total variation distance. The idea of converting approximate priors pf into 
priors 7r, via conditioning comes from Wu and Yang (13- 
Now it follows from Lemma 11 and Markov’s inequality that 

2 


Rp(S, n, H, e) > s 2 inf sup F (\H — H(P)\ 

H PeM S (e,H) ' 


> s) h x h 


H 

h VS 


In 


Sing 

nH In n 


( 120 ) 


and the desired result follows directly from Lemma 


14 


Hence we have obtained the desired lower bound in the case S A n 2 . 


For general S >: n 2 , the non-decreasing property of Riri. S. II) with respect to S shows that 


R(n, S, H) > R(n,n 2 , H) f 


[ H i n 

f S ln S V 

2 

ln^ 1 

^ nH In n J 



H 2 


( 121 ) 


S—n 2 


S In S 


which exactly equals to the desired lower bound [In ^ nH ln - 


)] 2 - 


VII. Future work 

This paper studies the adaptive estimation framework to strengthen the optimality properties of the approximation theoretic 
entropy estimator proposed in Jiao et al. ©■ We remark that the techniques in this pager are by no means constrained to 
entropy, and we believe analogous results are also true for the estimators of F a {P) = X^i=iP? i n |H ■ Furthermore, we find 
the fact that the sample size enlargement effect still holds in the adaptive estimation setting very intriguing, and we believe 
there is a larger picture surrounding this theme to be explored. 
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Appendix A 
Auxiliary Lemmas 

The following lemma characterizes the performance of the best uniform approximation polynomial for — xlnx,x £ [0,1]. 


Lemma 15. Denote by 'Yhk=o9K,kX k the K-th order best uniform approximation polynomial for —a;lna;,x £ [0,1], then for 


Pk(x) = Tl-, 9 K.kX k , we have the norm bound 


sup | p K {x) - (—xln x)\ < —| 


ce[o,i] 


K 2 


where D n > 0 is a universal constant for the norm bound. In fact, the following inequality holds: 

limsup/v 2 • sup \pk(x) — (—xlnx)| < ui(2) « 0.453, 

K-* 00 a:€[ 0 ,l] 


( 122 ) 


(123) 


where the function v\(p) is was introduced by Ibragimov [20] as the following limit for p positive even integer and m positive 
integer 


lim 

n—f 00 


{\nn) m ~v E ^\ x \ PWn = "iCP)- 


(124) 
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Furthermore, we also have the pointwise bound: there exists a universal constant D p > 0 such that for any C > 1, 


| pk ( x ) — 21n/v • x| < D p Cx , Vie £ 


C 

0, k 2 


(125) 


Lemma 16. [21 Thm. 8.4.8] There exists some universal constant M > 0 such that for any order-n polynomial p(x) in 

[ 0 , 1 ], we have 

sup |p(a:)| < M ■ sup \p{x)\. (126) 

xG[0,l] xG[n _2 ,l- n~ 2 ] 

The following lemma gives some tails bounds for Poisson and Binomial random variables. 

Lemma 17. [22 Exercise 4.7] If X ~ Poi(A), or X ~ B(n, —), then for any S > 0, we have 

A 

(127) 

(128) 


(1 + 

-s 


P(X > (1 + S)X) < ^ 


Appendix B 
Proof of Lemmas 

A. Proof of Lemma [j] 

Denote H(P) = ~Pi 1 npi by H, we construct the lagrangian: 

C = 5>(lnpi ) 2 + A I ^2 ~Pi 1 n Pi ~ + P ( Pi ~ 1 

i=i \»=i / \»=i / 

By taking the derivative with respect to p t , we obtain that 


dpi 


= (In 2 + 2 In pi - A(1 + In pf) + p 


(129) 


(130) 


is a quadratic form of lnpj, so the equation = 0 has at most two solutions. 

Hence, we conclude that components of the maximum achieving distribution can only take two values pi £ { 91 , 92 }, and 
suppose 9 ! appears m times. Then our objective function becomes 

s / s \ / s 


J2 pi ( in pi ) 2 = J2 pi ^ np ^ 2 ) 


Pi 


Ki= 1 
5 


= I X! “P* ln Pi ) + X! PiPj(InPi - lnpj ) 5 

Vi=l / l<i<j<S 

= H 2 + m(S - m) 9 i 92(11191 - ln 9 2 ) 2 . 

We distinguish the analysis into two cases. 

1) Case I: If min{ 9 i, 92 } > ^ 5 , we have — lnpi < 21ns 1 for all i. Hence, 

s s 

Pi (In P ?:) 2 < 21nS-^ —Pi In pi = 2 H In S. 


(131) 

(132) 

(133) 


(134) 


i=l 


i=l 


2) Case II: If one of 91 and 92 is smaller than -Ay, without loss of generality we can assume that 91 < ^. Then 

s 

^Pi(hiPi ) 2 = H 2 + m(S - m)q 1 q 2 (lnq 1 -ln 9 2 ) 2 (135) 


< H 2 + m(S - m) 9 i 9 2 (lii 9 i ) 2 
<H 2 + Sq i(hi 9 i ) 2 


1 


< H 2 + S ■ ( In 

= H 2 


1 


S 2 \ S 2 

4(ln S) 2 


S 


(136) 

(137) 

(138) 

(139) 
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where we have used the inequalities m < S, (S — m)q 2 < 1 and the monotonically increasing property of a;(lna ;) 2 for 
x £ [0, e -1 ]. Then the lemma is proved by noticing that H < Ini'. 

B. Proof of Lemma [7] 

The lower bound follows directly from the concavity of x Inx, 0 < x < 1. For the upper bound, 
p\nn — E[— p\np\ = E[pln(np)] 


= E 


k In k 


fc=l 


p > — I — P \ p > 


k + 1 


= E 


k In k „ 


k =1 


^E 


k In c„ 


fc=i 


p = 


P = ~ 
n 


cincjl 

n 


(K) + E 


A: In k — (k — 1) ln(A; — 1) 


n 


k=c +1 

clnc^/^ c\ EE A: In k — (k — 1) ln(/c — 1) 


n 




< p In c + 


cine. 




/c=c+l 

A: In k — (A: — 1) In (A; — 1) 


n 


k=c -\-1 


The Chernoff bound yields 

P {np > (1 + 5)np) < 

hence for any integer k, we have 


(1 + $)!+* 


< 


0 l+5 \ n P 


(1 + < 5) 1+ ‘ 5 


#>-V 


(l+<5)np 



(140) 


(141) 

% 

IV 

1 ?s- 

(142) 

\ n J 


77 

Al 

(143) 

\ U J 

(144) 


1 + 6 


r{p>+j=r(np>k)<(?f)\ 


Then the desired result follows directly from the fact that 

k In k — (k — 1 ) ln(fc — 1 ) = In k + ( k — 1 ) In ^1 + 

C. Proof of Lemma [9] 

For the bias, it is straightforward to see that for nX ~ Poi(np), we have 

K 

ES k ,h(X) + plnp = y^r g|g (4A)~ fc+ 1 p fc + plnp 


k- 1 


< In k + 1 . 


fc=l 

= 4A 
= 4A 



( p ^ 

Pk 

La) 



Pk 

La) 


ln(4A) 


P_ 
4AJ 
P ' 
4A. 


- plnp 


4c i 


-plnp 


(145) 


(146) 


(147) 


(148) 

(149) 

(150) 


where pk{x) = Y^k=i 9K,HX k is the best approximating polynomial appearing in Lemma 15 Since ^ < Ay, Lemma 
asserts that 


15 






and we conclude that 

|E Sk,h(X) + plnp\ < —p\n{pn\wn) + ( D p + ln(4ci/c2)) p. 

The proof for the second part is similar to HD Lem. 5]. 


(151) 


(152) 


D. Proof of Lemma 13 


By defining 


/n{x) = - In (EE + ’ - 1 < x < 1 


(153) 
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we have El [/at] [- 1 . 1 ] = El[— lnxjpy- 1 ,!]- Let Al{x) = '^ 1 L x2 + j 2 and define the following modulus of continuity for /: 

n(f, A L ) = sup{|/(ar) - f(y )| : x,y <E [-1,1], \x - y\ < A L (a;)} (154) 

We have the following lemma. 

Lemma 18. There are an upper bound and a lower bound for ti(/jv, A/,): 


ln ( 2 ^)- Tl(/w ’ Ai) - ln (^ 


VL < 


-/AT 

IF 


(155) 


Proof: The upper bound is shown in [15 Lem. 4]. For the lower bound, denote by xl € [—1,1] the solution to the 
equation xl — A l (x l ) = —1, we have the following closed-form formula: 


L 2 - L 4 + V-3L 2 + L 4 „ 1 

+ 


(156) 


Hence, by definition, we have 

Ti(fN,A L ) > \f N (x L ) - f n{ 1)| = In 


£i±i w + i^i)> 1 „(££±i J v)>,„( i L). (157) 


The relationship between ti(/jv,Al) and El[/n\ [_t,il was shown in 123 Thm. 3.13, Thm. 3.14] that there exist two 
universal constants Mi, M 2 > 0 such that 


En[f N }[-i,i] < Mm(/iv,A n ) 

1 71 

- -gfc[/jy]f-i,n > M 2 ti(/iv, A„) 


(158) 

(159) 


fc =0 


Applying ( |158| l and ( | 1 59[ > and setting the approximation order to be DL with constant 1) > 1 to be specified later, then 
given A = (10 D) 2 M > (10 DL) 2 , the non-increasing property of E n [/jv][_ 1 , 1 ] with respect to n yields 


-El [/tv] [- 1 , 1 ] > 


1 


DL 


> 


DL-L 
1 


y, E n [f N \_ i ;1 


n=L+l 
DL 


DL 


Enl/jvlt-!,!] ^ E 0 [/jv][- 1 , 1 ] - E n [/jy] [—!,!] 


\n—0 


> M 2 T 1 (f N , A dl ) - ^ n(/tV) A n ) 

n—1 

w , / A \ In A iF, / 2A 

V2 (DL) 2 ) DL DL ^ 

, / A \ In A Mi F, /2A\ , 

> M 2 In [ I -„ ,--^r / In f —s- ] dx 


n- 


\2{DL) 2 J DL DL J 1 \x 2 J 
/ 50 A \ lnA + 21n(10L>) Mi /200e 2 A 2 A 

- 2 n VFF/ dl Ld [ u \ L 

Hence, there exists a sufficiently large constant D > 0 such that 

E £ ,[-lnx][( 100D 2 A -)- 1) i] = E L [/F-i,i] A In 
and this lemma is proved by setting D 0 = maxjlOOA 2 ,1}. 


(160) 

(161) 

(162) 

(163) 

(164) 

(165) 

(166) 


E. Proof of Lemma \!4\ 

Fix S > 0. Let H{ Z) be a near-minimax estimator of H(P) under the Multinomial model. The estimator II (Z) obtains the 
number of samples n from observation Z. By definition, we have 

sup E M ultinomial|E(Z) - H (P) | 2 < R(S,n,H) + 5, 

PGMs(H) 


( 167 ) 
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where R(S,n,H) is the minimax A 2 risk under the Multinomial model. Note that for any vector P £ A4s(e, H) ( P is not 
necessarily a probability distribution), we have 

/ s \ 


H 


P 


sr^S I sr^ S 

, Lji= 1 Pi ) Pi 


V i=l , 


d\ — e 


(168) 


where by definition we have 


di 


J2i=iPi — d\ < e. Hence, given P £ Ms{e,H), let Z = [Zi,--- , Zs] T with Z j: ~ Poi (npt) 

^H( Z) — lndi^j to estimate H(P). Note that 

+ X! Pi ln X! Pi ) ~ dl ln dl 


and let n! = JL =1 ~ P°i(n Pi)> < 168) suggests to use the estimator d\ 

P 


(H{ Z) - lndi) - H(P) = di H{ Z) - H 


Xi=lPi, 


(169) 


^i=l 


s.i=l 


the triangle inequality gives (define A = sup a . e [ dl _ £)dl+e ] ln 2 (ea;)) 

P 


^Ep 


di (ff(Z)-lndi) - H(P) 


< d 1 Ep 


H(Z)-H 


Xi=lPi, 


K i= 1 


vi =1 


— ^1 ^ y Ep 


m—0 
00 


. ( p \ 

2 


H Z H q 

VEf=i Pi) 


n' = m 




J ln | y ^Pi I - dl lnd-, 

n! = to) + e 2 A 


< di R (s, 


m —0 


H 


,m, 


di — e 


+ ln(di + e) ) P (n! = m) + 6 + e 2 A 


(170) 

(171) 

(172) 


< d\R \S, 

< d\R [S, 


+ ln(di + e) ) P(n / > ) + (di lnS) 2 P(n r < ) + 6 + e 2 ^4 

(173) 

(174) 


+ ln(di + e) ) + (di ln S) 2 exp(— ) + 6 + e 2 A 


din H 
2 1 di — e 

din H 
2 1 di — e 

where we have used the fact that conditioned on n! = to, Z ^ Multinomialfm, ), and R(S, n, H) < (suppg^s -H^A )) 2 = 
(ln S) 2 . Moreover, the last step follows from Lemma 17 The proof is completed by the arbitrariness of 5 and Lemma [l] 

F. Proof of Lemma [75] 

It has been shown in tm Lemma 18] that 


lim K 2 ■ sup 

®e[o,i] 


K 


k =0 


then plugging in x = 0 yields 


y ~2gK,kX k - (-arlnx) 

pi (2) 


pi(2) 

2 ’ 


lim sup AT 2 • \g Kfi \ < 

K—f 00 " 


Hence, it follows from the triangle inequality that 

K 

lim sup K 2 • sup ~ (—xlnx) 


K^r c 


c 6 [ 0 ,l] 


k =1 




which completes the proof of the norm bound. 

For the pointwise bound, 0 Thm. 7.3.1] asserts that there exists a universal positive constant Mi such that 

sup \(ip(x)) 2 px(x)\ < MiK 2 u)l(-x\nx, A'" 1 ), 

*€[ 0 , 1 ] 


(175) 


(176) 


(177) 


(178) 


where tp(x) = \Jx(l — x), and w 2 (/, t) is the second-order Ditzian-Totik modulus of smoothness [21] defined by 


vl(f,t) = sup 


f(u) + f(v) - 2 / 


U + V 


: u, v £ [ 0 , 1 ], |tt — v\ < 2tip 


u + v 


(179) 
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we have 


w ^,(—2 In x, t) 


2 1 2 In 2 
1 + t 2 


(180) 


sup \x(l — x)p'k(x)\ < 2Mi ln 2 . (181) 

®e[o,i] 


According to Lemma 
that 


16 


since p'k(x) is a polynomial with order K 


2 < 2 K, there exists some positive constant M 2 such 


sup \p'k{x)\<M 2 


xe[o,i] 


sup 


\Pk(x)\ 


xe[(2K)- 2 ,1-(2K)~ 2 } 

M 2 (2K ) 4 . 

-Wr\2 —T sup I 

\ Zix ) ~ 1 xe[(2K)~ 2 ,l-(2K)- 2 ] 

16 M 2 K 4 . 

< 777—7 sup \x{l - x)p K (x)\ 

4A 2 - 1 11 


< 


*e[o,i] 
32M-, M 2 K 4 In 2 


4AT 2 - 1 
< 16MiM 2 K 2 In2, 


hence for any x,y £ [0, C/A' 2 ], we have 


/*max{x,y} 

|Pjf(a;) -Pjf(y)| < / < 16 MiM 2 In2 • I< 2 \x - y\ < 16M 1 M 2 Chi2. 

As a result, we know that for any C > 1 and 2 € [0, C/K 2 ], 

2 c 

16MiM 2 Cln2 > f \p' K {x) — p' K (t)\dt 


C 


> 


> 


K 


p'k{x)--qJ p'k (0 

' w,i 4 b (£) 


dt 


K 2 

~C 




> |Pif( 2 ) — 21niL| — lnC — K 2 sup \px{t) — (—t1nt)\ 


te[o,i] 


> | p' K ( x ) ~ 2 In AT | - lnC - D n , 


where l)„ is the coefficient of the norm bound in (122 1 . Hence, the universal positive constant D p = 16M-| M 2 In 2 


satisfies 


and it follows that 


C 

°’K 2 


\p'k ( 2 ) — 21nA'| < D P C, Mx £ 

pX pX 

\pk{x) — 2 In K ■ x\ < / | p'xif) ~ 2 In A' dt< D p Cdt = D p Cx. 
Jo Jo 


(182) 

(183) 

(184) 

(185) 

(186) 

(187) 

(188) 

(189) 

(190) 

(191) 

(192) 

(193) 
1 + D n 

(194) 

(195) 
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